This report is an analysis of the loan data from prosper. The prosper loan data (pld) consists of data for more than 110,000 loans with 81 variables decribing each loan.

Prosper Loan Data

## 'data.frame':    113937 obs. of  82 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Date, format: "2007-08-26" "2014-02-27" ...
##  $ CreditGrade                        : Ord.factor w/ 7 levels "HR"<"E"<"D"<"C"<..: 4 NA 1 NA NA NA NA NA NA NA ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Date, format: "2009-08-14" NA ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Ord.factor w/ 7 levels "HR"<"E"<"D"<"C"<..: NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Date, format: "2007-08-26" "2014-02-27" ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Date, format: "2001-10-11" "1996-03-18" ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Ord.factor w/ 8 levels "Not employed"<..: 4 5 8 4 7 7 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Date, format: "2007-09-12" "2014-03-03" ...
##  $ LoanOriginationQuarter             : Ord.factor w/ 35 levels "Q4 2005"<"Q1 2006"<..: 8 34 6 29 32 33 31 31 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...
##  $ ListingCategory                    : Factor w/ 21 levels "Not Available",..: 1 3 1 17 3 2 2 3 8 8 ...

Univariate Plots Section

The plot above show the histogram of loan start date. using a bin width of 30 would approximately give us the number of loans started each month. there is a gap around 2009 and number of loans start to increase again. I am not sure if data is representative of overall financial market. Possibly the increased number of loans might be due increased bussiness by Prosper rather than overall increase in loan requests.

The information for loans originated prior to July 2009 is different than loans originated after July 2009. To keep the analysis consistent, I have decided to only consider loans origianted after July 2009.

## 
##              Cancelled             Chargedoff              Completed 
##                      0                   5342                  19786 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   1008                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

The table above shows the number of loans in various stages

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7500    9076   13500   35000

The loan amounts distribution is positively skewed. However, if we use a log scale for the x axis, the histogram show a normal distribution. Note that most loans are exact values and the distribution shows peaks at common loan values of $4000, $10000, and $15000. the smallest loan is $1,000 and largest loan is $35,000

Loan term

## 
##    12    36    60 
##  1614 58825 24545

The loans are mostly for 36 months with some loans for 60 months and a small number of loans are 12 months long.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.583  16.320  21.940  22.660  29.250  42.400

The figure above shows the distribution of Annual Percentage Rate (APR) for the loans. The original number was in rate rather than percentage, so I multiplied it by 100 to be easier to comprehend.

Unsurprisingly, monthly payments follow the same pattern as the original loan amount.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3433    5000    5930    7083 1750000

The monthly income of borrowers vary with a median of 5000 While the maximum recorded monthly income is $1,750,000, it is most probably a mistake (I cannot think someone with that much money would get a $10,000 loan), and most incomes are less than $10,000 a month. Stated monthly income is provided by the borrower, so its reliability is not clear.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.150   0.220   0.259   0.320  10.010    7307

The average debt to income ratio is around 0.25, meaning borrowers generally carry a quarter of their income as debt. While 75% of borrowers have debt to income ratio of lower than 0.32, some borrowers have debt to income ratio of close to 1. A small number of borrowers have reported debt to income ratio of 10.01 which in fact means the actual debt to income ratio is greater than 10.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   619.0   679.0   719.0   718.4   739.0   899.0

In the dataset only the credit score bucket is reported, with the lower and upper range in separate variables. Here the histogram is for the upper range of the individauls credit score. Overall, we see a normal distibution for the credit score with a slight positive skew.

The graphs above shows the distribution of the borrower’s credit grade. Overall, average credit grades have a higher proportion.

Loan purpose

## 
##      Not Available Debt Consolidation   Home Improvement 
##                 20              53246               6812 
##           Business      Personal Loan        Student Use 
##               5315                  0                280 
##               Auto              Other      Baby&Adoption 
##               2244               9242                199 
##               Boat Cosmetic Procedure    Engagement Ring 
##                 85                 91                217 
##        Green Loans Household Expenses    Large Purchases 
##                 59               1996                876 
##     Medical/Dental         Motorcycle                 RV 
##               1522                304                 52 
##              Taxes           Vacation      Wedding Loans 
##                885                768                771

This variable is the category defined by the borrower as the reason for the loan. By far, most loans are for debt consolidation (probably credit card debt). Home improvement and business are two other most common reasons for borrowing money.

Univariate Analysis

What is the structure of your dataset?

There are 113937 observations each having 81 variables describing a loan. To ensure data consistency I remove loan riginated before July 2009, resulting in around 85,000 samples. Some varaibles describe the loan (LoanOriginalAmount, LoanOriginationDate, Term, BorrowerAPR, MonthlyLoanPayment, etc.). There are varaibles describing the borrower condition provided by borrower such as ListingCategory, StatedMonthlyIncome, Occupation, etc. Majority of the variables describe the credit status and history of the borrower (CurrentCreditLines, TotalCreditLinespast7years, DelinquenciesLast7Years, RevolvingCreditBalance, etc.). These variables define the risk associated with the borrower and possbiliy define the credit score (CreditScoreRangeUpper and CreditScoreRangeLower) of the borrower.

There are 4 type of variables in the dataset: - date variables, these variable were convert to R Date structure using as.Date function - numbers including the dollar amounts, rates, and integers - factors for variables such as CreditGrade, Occupation, IncomeRange, as well as boolean variables. - Unique IDs that can be integers or sequence of alphanumerics

It is worth noting that almost all variables are missing some samples. However, the number of missing variables are insignificant comapred to total observations and will not affect the integrity of the conclusions.

What is/are the main feature(s) of interest in your dataset?

The main interest is to identify what affects APR. It is clear that CreditGrade and APR are directly related, but it would interesting to see if they have a one-to-one relation or not. And how CreditGrade can be calculated from the history.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Besides the Creditgrade, I would like to investigatet if any of the following impact the APR: income, loan purpose, loan date, loan amount, occupation. I would also like to see what aspect of the credit history impacts the creditgrade the most. The variables of interest are: IsBorrowerHomeowner, FirstRecordedCreditLine, CurrentCreditLines, OpenRevolvingAccounts, OpenRevolvingMonthlyPayment, RevolvingCreditBalance, DebtToIncomeRatio, and BankcardUtilization.

Did you create any new variables from existing variables in the dataset?

I created a new variable ‘CreditHistoryLength’ which is the time difference between loan date and first credit line in years.
I added anothe variable, the ‘CreditLoanRatio’ which is the ratio between AvailableBankcardCredit and LoanOriginalAmount.
I also created a new variable which was basically the actual name of the loan category instead of the integer number in the dataset.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Some variables such as CreditGrade and IncomeRange that are expected to have ordered factors did not have proper ordering so I changed the order of factors.
Also the dates variables were imported as strings into a factor. They needed to be changed to R date varaible to be work effectively. I used the as.Date function to change them. The time of day part of the variables were ignored.
The loan amount distribution is positively skewed, and log-transforming the price proved to show the normal distribution in loan prices.

Bivariate Plots Section

The plots above show the relation between select variables. the first row show the relation between APR and other variables. The impact of credit rating on APR is very clear. Also there considerable correlation between APR and credit score, loan amount, as well as bankcard utilization.
Looking at credit rating and loan amount, it seems people with lower credit rating only qualified for smaller loans. Also loan amounts seemed to have generally increased over years. This might be related to prosper bussiness growth rather than borrowers demand for larger loans.
the relation between APR and credit rating and other variables need to examined in more details.

## # A tibble: 8 × 4
##   ProsperRating..Alpha.   APR.mean APR.median     n
##                   <ord>      <dbl>      <dbl> <int>
## 1                    HR 0.35606120    0.35797  6935
## 2                     E 0.33055055    0.33215  9795
## 3                     D 0.28058055    0.28488 14274
## 4                     C 0.22612440    0.22362 18345
## 5                     B 0.18403003    0.18173 15581
## 6                     A 0.13890939    0.13799 14551
## 7                    AA 0.09004073    0.09000  5372
## 8                    NA 0.18688130    0.17018   131

As suspected previously, the main varaible describing the APR is the ProsperRating. This variable seems to be one that Prosper is using to choose the APR for its costumers. Therefore, it is likely not avaiable before starting a loan.

The graphs above show the loan APR broken into various graphs based on the borrowers credit grade. In this graph, we can see more detail about the APR distribution based on the credit grade. The distirbutions validate the concludions made from the previous boxplots. It is also clear that generally APR has lower variance for people with better credit grade.

The graph above present the credit score distribtion for various credit grades. While borrowers with higher credit grades generally have higher credit score, it seems credit score alone does not define the credit grade of the borrower.

## 
##  Pearson's product-moment correlation
## 
## data:  pldn$CreditScoreRangeUpper and pldn$BorrowerAPR
## t = -180.4, df = 84982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.531077 -0.521354
## sample estimates:
##        cor 
## -0.5262327

As expected, there is strong correlation between credit score and the APR. As the credit score decreases the APR increases.

Over the years, the maximum loan interest rate have slightly decresed, but overall interest rates does not seem to be affected significantly by time.

Higher loan amounts seem to have lower APR. However, it is unlikely that the APR is lower due to larger loan. I suspect people with lower credit score do not qualify for larger loans; therfore, we do not see large loans with high APR. And generally poeple with better credit score (which would get lower APR) can get the larger loans.

## 
##  Pearson's product-moment correlation
## 
## data:  pldn$CurrentCreditLines and pldn$BorrowerAPR
## t = -32.039, df = 84982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1158856 -0.1025995
## sample estimates:
##        cor 
## -0.1092474

It seems there is a weak relation betweem number of credit lines and APR. and poeple with higher number of credit lines have lower APR and people with higher APR have few credit lines. Poeple who have access to credit line would only consider then loan if APR is lower than their credit line. However, poeple who do not have creditline have no choice but to get the loan at high APR.

## 
##  Pearson's product-moment correlation
## 
## data:  pldn$AvailableBankcardCredit and pldn$BorrowerAPR
## t = -116.75, df = 84982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3775567 -0.3659686
## sample estimates:
##        cor 
## -0.3717771

Similar to having credit lines, someone who has access to credit through bank card (credit card I presume) would likely choose a loan only if the rate is favourable. For individuals without access to credit through bank card, they have no choice but to accept high APR loans.

## 
##  Pearson's product-moment correlation
## 
## data:  pldn$AvailableBankcardCredit and pldn$LoanOriginalAmount
## t = 74.539, df = 84982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2414024 0.2540238
## sample estimates:
##       cor 
## 0.2477236

Poeple who have access to more credit thorugh bank card usually take larger loan, likely they use their bank card for smaller loans.

Looking at the histogram of the APR for the two cases (loans higher and lower than avaialble bank card credit), it is clear that average APR is higher when poeple are getting loans higher than their avialble bank card credit.

## 
##  Pearson's product-moment correlation
## 
## data:  pldn$BankcardUtilization and pldn$BorrowerAPR
## t = 74.523, df = 84982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2413528 0.2539745
## sample estimates:
##       cor 
## 0.2476742

Besides having credit available through bank card, the amount that their bank card is utilized also matter for deciding to use the bank card or get a loan. poeple with higher bank card utilization have no choice but to accept higher APR.

## 
##  Pearson's product-moment correlation
## 
## data:  pldn$InquiriesLast6Months and pldn$BorrowerAPR
## t = 78.475, df = 84982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2536616 0.2661996
## sample estimates:
##       cor 
## 0.2599415

There is some correlation between recent inquiries and APR. The likely reasoning is that more inquiries shows that borrower had been declined from other loans and has not other options.

## 
##  Pearson's product-moment correlation
## 
## data:  pldn$TradesNeverDelinquent..percentage. and pldn$BorrowerAPR
## t = -82.155, df = 84982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2774715 -0.2650143
## sample estimates:
##        cor 
## -0.2712543

There is a considerable correlation between percentage of TradesNeverDelinquent and APR. People who have a good track record of paying of their debts are more likely to get better APR.

## 
##  Pearson's product-moment correlation
## 
## data:  pldn$CreditHistoryLength and pldn$BorrowerAPR
## t = -23.05, df = 84982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08550195 -0.07213895
## sample estimates:
##         cor 
## -0.07882399

There is not much correlation between credit history length and APR. This is because most individuals have have long credit history. But, focusing on individuals with very short credit history, we see that short credit histry result in higher APR.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

My aim is to describe the APR based on various variables describing the credit history. Although ProsperRating significantly describes the APR, I expect it not be available for a new individual before getting a loan. From the variables expected to be openly available, the most correlation is between APR and credit score. Higher credit score results in lower APR which is reasonable. However, the credit score alone does not represent all the variation in APR.
Other variables such as AvaialbleBankCardCredit and BankCardUtilization will define the borower’s access to other sources of credit and impacts their APR.
The loan amount and loan starting date does not seem to impact the APR. Although the APR for larger loans is lower on average, it is unlikely that asking for bigger loan would result in lower APR. I suspect the lower APR is the result of those individual having a better credit history.
Other variables such as delinquencies and credit score inquiries also have a negative impact on the APR.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

On average loan amount seems to be consistent with the credit individuals have access through bank credit card. poeple who have more credit through bankcard get bigger loans from Prosepr. Additionaly, when the loan amount is lower than the credit avaialble through bankcard for someone, that individual tend to get better APR compared to when the loan amount is higher than credit available through bankcard.

What was the strongest relationship you found?

The strongest relationship is between APR and ProsperRating, with APR decreasing with higher ProsperRating.

Multivariate Plots Section

My aim to describe the APR from different variable avaialble trough credit history. We identified that credit score is most significant factor describing the APR. now we look if we can identify other variables that describe the APR variation within similar credit scores.
Most variables in the porcess are continuous, however, it is hard to see the variation in the color using conitnuous variables. I used the ‘cut’ function to break the continuous variables into facotrs and see the color changes easier.

The number of credit lines does not describe the variations in the APR within similar CreditScoreRange values.

We knew before that Poeple with lower bankcard utilization have higher credit score. This is again very dominant in this graph. However, within same CreditScoreRange, we can see that poeple with lower BankcardUtilization generally have lower APR.

Again, there is significant correlation between CreditLoanRatio and CreditScoreRange, i.e. people with higher CreditScoreRange have higher CreditLoanRatio. The impact of CreditLoanRatio on APR is not very clear for poeple with very high or very low credit scores. However, for people with average credit score (around 740 to 800), we can see that higher CreditLoanRatio generally yields to lower APR.

The impact of AvailableBankcardCredit is very similar to CreditLoanRatio. Given that the loans amounts follow a normal distribution, this is expected. However, it is not clear which one is the primary feature driving the APR.

Contrary to my expectations, the percentage of TradesNeverDelinquent does not seem to have much impact on the APR wihtin similar CreditScoreRange. One explenation can be that all the impacts of TradesNeverDelinquent is already accounted for in CreditScoreRange. Therefore, it cannot describe any more variation in APR.

Looking at the impact of InquiriesLast6Months on APR, we can see that poeple who have 0 or 1 Inquiry in last 6 month on average get better APR compared to poeple with more inquiries in the same CreditScoreRange.

From the graph above we can see that DebtToIncomeRatio partially impact the APR. Within same CreditScoreRange, poeple with lower DebtToIncomeRatio recieve lower APR for their loans.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

When I looked at at TradesNeverDelinquent versus APR, there was considerable corelation between the two. However, this correlation seems to mostly camptured by credit score, i.e. within same credit score range, TradesNeverDelinquent cannot describe the APR.

On the Other hand, AvailableBankcardCredit significantly impact APR, even for people with same credit range. Same pattern also happended for credit to loan ratio.

Another two variable that showed significant impact on APR were InquiriesLast6Months and DebtToIncomeRatio. When lower, both would result in lower APR for the customers.

Final Plots and Summary

Plot One

Description One

Here we have the distrbituion of loan amounts. If we use large bins with a logarithmic X axis scale, we see that the loans follow a normal distribution. However, when we use smaller bins, we see another phenomenon. Although overall loans are normally distributed, loans are generally round numbers, with loan amounts of 4000, 10000, and 15000 bein the most common.

Plot Two

Description Two

This plot shows the the loan’s APR with respect to the rating that was given to borrower by prosper. There is a direct and strong realtionship between the rating and APR. Furthermore, as the ratings worsen the APR also have more variation.

Plot Three

Description Three

The plot above depicts loans APR as a function of borrowers credit score. The points are colored based on the credit available to the borrower through bankcards. The first conclusion is that higher credit scores yield a lower loan APR for the borrower. Second, poeple with higher credit score have more access to credit through bankcards. Third, for people with similar credit score, those who have access to more credit through bankcards can obtain a loan with lower APR.

Reflection

We looked at the loan data from Prosper. There are more than 110,000 loan data in this dataset. However, parts of data features were different before and after July 2009. To ensure a consistent analysis, only data for loans after July 2009 was used in this report. I needed to condition some of variables such as date variables and factor variables to work easier with them. I started by exploring by some interesting features such as loan amount, credit score, APR, prosper rating, etc. Eventually, I decided to look into how to identify the relationship between APR and other features, i.e. how Prosper decides what APR a borrower gets based on their history and current status.

The most correlation was between APR and the rating that Prosper assigns to borrowers. However, I decided not to use Prosper rating, as I suspected it is a variable that prosper calculates and likely indivdiduals do not have access to it. Besides prosper rating, the feature that described the APR the most was the credit score. On average, higher credit score coresponds to lower APR. However, credit score does not completely describe the APR variation. Other variables that I found to be impacting the APR were: DebtToIncomeRatio, InquiriesLast6Months, AvailableBankcardCredit, and BankcardUtilization.

Contrary to my expectation, I coudld not find a clear impact from the following variables on the APR: CurrentCreditLines, TradesNeverDelinquent, and LoanOriginationDate. I did not try to isolate the impact of other variables before assessing these varaibles. So it is possible that their impact is masked by variation due to other features.